IlIIIllHlilllllllNlllll 

US006374336B1 

(12) United States Patent (10) Patent No.: US 6,374,336 Bl 

Peters et aL (45) Date of Patent: *Apr. 16, 2002 



(54) COMPUTER SYSTEM AND PROCESS FOR 
TRANSFERRING MULTIPLE HIGH 
BANDWIDTH STREAMS OF DATA 
BETWEEN MULTIPLE STORAGE UNITS 
AND MULTIPLE APPLICATIONS IN A 
SCALABLE AND RELIABLE MANNER 

(75) Inventors: Eric C. Peters, Carlisle; Stanley 
Rabinowitz, Westford, both of MA 
(US); Herbert R. Jacobs, Hudson, NTH 
(US); Richard Baker Gillett, Jr., 
Westford; Peter J. Fasciano, Natick, 
both of MA (US) 

(73) Assignee: Avid Technology, Inc., Tewksbury, MA 
(US) 

( * ) Notice: Subject to any disclaimer, the term of this 
patent is extended or adjusted under 35 
U.S.C. 154(b) by 0 days. 

This patent is subject to a terminal dis- 
claimer. 

(21) Appl. No.: 09/054,761 

(22) Filed: Apr. 3, 1998 

Related U.S. Application Data 

(63) Continuation of application No. 09/006,070, filed on Jan. 12, 
1998, now abandoned, and a continuation of application No. 
08/997,769, filed on Dec. 24, 1997, now abandoned. 

(51) Int. CI. 7 G06F 12/00; G06F 13/372 

(52) U.S. CI 711/167; 711/114; 714/6; 

709/233; 725/92; 725/93; 707/205 

(58) Field of Search 711/170, 100, 

711/4, 112, 114, 133, 162, 167, 169; 709/102, 
104, 105, 225, 226, 230, 231, 232, 233, 
240, 217, 219; 714/6-8, 18; 348/7; 725/92, 
93, 97, 101; 707/104, 200, 201, 205 

(56) References Cited 

U.S. PATENT DOCUMENTS 
4,887,204 A 12/1989 Johnson et al 707/10 




5,262,875 A 11/1993 Mincer et a 1 386/101 

(List continued on next page.) 

FOREIGN PATENT DOCUMENTS 

EP 0 701 198 Al 3/1996 

EP 0 740 247 A2 10/1996 

(List continued on next page.) 

OTHER PUBLICATIONS 

Scree nivas Gollapudi et al, "Net Media: A Client-^Server 
Distributed Multimedia Database Environment," Apr., 1996, 
University at Buffalo, Department of Computer Science, pp. 
1-17. 

(List continued on next page.) 

Primary Examiner — Matthew Kim 
Assistant Examiner — Pierre-Michel Bataille 
(74) Attorney, Agent, or Firm — Peter J. Gordon 

(57) ABSTRACT 

Multiple applications request data from multiple storage 
units over a computer network. The data is divided into 
segments and each segment is distributed randomly on one 
of several storage units, independent of the storage units on 
which other segments of the media data are stored. Redun- 
dancy information corresponding to each segment also is 
distributed randomly over the storage units. The redundancy 
information for a segment may be a copy of the segment, 
such that each segment is stored on at least two storage units. 
The redundancy information also may be based on two or 
more segments. This random distribution of segments of 
data and corresponding redundancy information improves 
both scalability and reliability. When a storage unit fails, its 
load is distributed evenly over to remaining storage units 
and its lost data may be recovered because of the redundancy 
information. When an application requests a selected seg- 
ment of data, the request may be processed by the storage 
unit with the shortest queue of requests. Random fluctua- 
tions in the load applied by multiple applications on multiple 
storage units are balanced nearly equally over all of the 
storage units. This combination of techniques results in a 
system which can transfer multiple, independent high- 
bandwidth streams of data in a scalable manner in both 
directions between multiple applications and multiple stor- 
age units. 
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COMPUTER SYSTEM AND PROCESS FOR 
TRANSFERRING MULTIPLE HIGH 
BANDWIDTH STREAMS OF DATA 
BETWEEN MULTIPLE STORAGE UNITS 
AND MULTIPLE APPLICATIONS IN A 
SCALABLE AND RELIABLE MANNER 

CROSS-REFERENCE TO RELATED 
APPLICATIONS 

This application claims the benefit under 35 U.S.C. § 120, 
and is a continuing application, of U.S. patent application 
Ser. No. 09/006,070, filed on Jan. 12, 1998, and U.S. patent 
application Ser. No. 08/997,769, filed Dec. 24, 1997, now 
abandoned. 

FIELD OF THE INVENTION 

The present invention is related to computer systems for 
capture, authoring and playback of multimedia programs 
and to distributed computing systems, 

BACKGROUND 

There are several computer system architectures which 
support distributed use of data over computer networks. 
These computer system architectures are used in applica- 
tions such as corporate intranets, distributed database appli- 
cations and video-on-demand services. 

Video-on-demand services, for example, typically are 
designed with an assumption that a user requests an entire 
movie, and that the selected movie has a substantial length. 
The video-on-demand server therefore is designed to sup- 
port read-only access by several subscribers to the same 
movie, possibly at different times. Such servers generally 
divide data into several segments and distribute the seg- 
ments sequentially over several computers or computer 
disks. This technique commonly is called striping, and is 
described, for example, in U.S. Pat. Nos. 5,473,362, 5,583, 
868 and 5,610,841. One problem with striping data for 
movies over several disks is that failure of one disk or server 
can result in the loss of all movies, because every movie has 
at least one segment written on every disk. 

A common technique for providing reliability in data 
storage is called mirroring. A hybrid system using mirroring 
and sequential striping is shown in .U.S. Pat. No. 5,559,764 
(Chen et al.). Mirroring involves maintaining two copies of 
each storage unit, i.e., having a primary storage and sec- 
ondary backup storage for all data. Both copies also may be 
used for load distribution. Using this technique however, a 
failure of the primary storage causes its entire load to be 
placed on the secondary backup storage. 

Another problem with sequentially striping data over 
several disks is the increased likelihood of what is called a 
"convoy effect." A convoy effect occurs because requests for 
data segments from a file tend to group together at a disk and 
then cycle from one disk to the next (a "convoy"). As a 
result, one disk may be particularly burdened with requests 
at the one time while other disks have a light load. Any new 
requests to a disk also must wait for the convoy to be 
processed, thus resulting in increased latency for new 
requests. In order to overcome the convoy effect, data may 
be striped in a random fashion, i.e., segments of a data file 
is stored in a random order among the disks rather than 
sequentially. Such a system is described in "Design and 
Performance Tradeoffs in Clustered Video Servers," by R. 
Tewari, et. al, in Proceedings of Multimedia' '96, pp. 
144-150. Such a system still may experience random, 
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extreme loads on one disk, however, due to the generally 
random nature of data accesses. 

None of these systems is individually capable of trans- 
ferring multiple, independent, high bandwidth streams of 

5 data, particularly isochronous media data such as video and 
associated audio data, between multiple storage units and 
multiple applications in a scalable and reliable manner. Such 
data transfer requirements are particularly difficult in sys- 
tems supporting capture, authoring and playback of multi- 

30 media data. In an authoring system in particular, data 
typically is accessed in small fragments, called clips, of 
larger data files. These clips tend to be accessed in an 
arbitrary or random order with respect to how the data is 
stored, making efficient data transfer difficult to achieve. 

15 

SUMMARY 

Data is randomly distributed on multiple storage units 
connected with multiple applications using a computer net- 
work. The data is divided into segments. Each segment is 

20 stored on one of the storage units. Redundancy information 
based on one or more segments also is stored on a different 
storage unit than the segments on which it is based. The 
redundancy information may be a copy of each segment or 
may be computed by an exclusive-or operation performed 

25 on two or more segments. The selection of each storage unit 
on which a segment or redundancy information is stored is 
• random or pseudorandom and may be independent of the 
storage units on which other segments of the data are stored. 
Where redundancy information is based on two or more 

30 segments, each of the segments is stored on a different 
storage unit. 

This random distribution of segments of data improves 
both scalability and reliability. For example, because the 
data is processed by accessing segments, data fragments or 

35 clips also are processed as efficiently as all of the data. The 
applications may request data transfer from a storage unit 
only when that transfer would be efficient and may request 
storage units to preprocess read requests. Bandwidth utili- 
zation on a computer network may be optimized by sched- 

40 uling data transfers among the clients and storage units. If 
one of the storage units fails, its load also is distributed 
randomly and nearly uniformly over the remaining storage 
units. Procedures for recovering from failure of a storage 
unit also may be provided. 

The storage units and applications also may operate 
independently and without central control. For example, 
each client may use only local information to schedule 
communication with a storage unit. Storage units and appli- 

5Q cations therefore may be added to or removed from the 
system. As a result, the system is expandable during opera- 
tion. 

When the redundancy information is a copy of one 
segment, system performance may be improved, although at 

55 the expense of increased storage. For example, when an 
application requests a selected segment of data, the request 
may be processed by the storage unit with the shortest queue 
of requests so that random fluctuations in the load applied by 
multiple applications on multiple storage units are balanced 

60 statistically and more equally over all of the storage units. 
This combination of techniques results in a system which 
can transfer multiple, independent high -bandwidth streams 
of data between multiple storage units and multiple appli- 
cations in a scalable and reliable manner. 

65 Accordingly, in one aspect, a distributed data storage 
system includes a plurality of storage units for storing data, 
wherein segments of data stored on the storage units are 
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randomly distributed among the plurality of storage units. 
Redundancy information corresponding to each segment 
also is randomly distributed among the storage units. 

When the redundancy information is a copy of one 
segment, each copy of each segment may be stored on a 5 
different one of the storage units. Each copy of each segment 
may be assigned to one of the plurality of storage units 
according to a probability distribution defined as a function 
of relative specifications of the storage units. The distributed 
data storage system may include a computer-readable 30 
medium having computer-readable logic stored thereon and 
defining a segment table accessible by a computer using an 
indication of a segment of data to retrieve indications of the 
storage units from the plurality of storage units on which the 
copies of the segment are stored. The plurality of storage J 5 
units may include first, second and third storage units 
connected to a computer network. 

In another aspect, a file system for a computer enables the 
computer to access remote independent storage units over a 
computer network in response to a request, from an appli- 20 
cation executed on the computer, to read data stored on the 
storage units. Segments of the data and redundancy infor- 
mation are randomly distributed among the plurality of 
storage units. Where the redundancy information is a copy 
of a segment, the file system is responsive to the request to 25 
read data, to select, for each segment of the selected data, 
one of the storage units on which the segment is stored. The 
file system may reconstruct a lost segment from other 
segments and the redundancy information. Each segment of 
the requested data is read from the selected storage unit for 30 
the segment. The data is provided to the application when 
the data is received from the selected storage units. In this 
file system, the storage unit may be selected such that a load 
of requests on the plurality of storage units is substantially 
balanced. The storage unit for the segment may be selected 35 
according to an estimate of which storage unit for the 
segment has a shortest estimated time for servicing the 
request. 

More particularly, the file system may request data from 4Q 
one of the storage units, indicating an estimated time. If the 
first storage unit rejects the request, the file system may 
request data from another of the storage units, indicating 
another estimated time. The file system requests the data 
from the first storage unit when the second storage unit 45 
rejects the request. Each storage unit rejects a request for 
data when the request cannot be serviced by the storage unit 
within the estimated time. The storage unit accepts a request 
for data when the request can be serviced by the storage unit 
within the estimated time. SQ 

The file system may read each segment by scheduling the 
transfer of the data from the selected storage unit such that 
the storage unit efficiently transfers data. More particularly, 
the file system may request transfer of the data from the 
selected storage unit, indicating a waiting time. The data 55 
may be requested from another storage unit when the 
selected storage unit rejects the request to transfer the data, 
or the file system may request the data from the same storage 
unit at a later time. Each storage unit rejects a request to 
transfer data when the data is not available to be transferred 60 
from the storage unit within the indicated waiting time. The 
storage unit transfers the data when the selected storage unit 
is able to transfer the data within the indicated waiting time. 

In another aspect, a file system for a computer enables the 
computer to access remote independent storage units over a 65 
computer network in response to a request, from an appli- 
cation executed on the computer, to store data on the storage 
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units. The file system is responsive to the request to store the 
data to divide the data into a plurality of segments. Each 
segment is randomly distributed among the plurality of 
storage units along with redundancy information based on 
one or more segments. The file system confirms to the 
application whether the data is stored. 

In this file system, when the redundancy information is a 
copy of the segment, the random distribution of data may be 
accomplished by selecting, for each segment, at least two of 
the storage units at random and independent of the storage 
units selected for other segments. The selected storage units 
may be requested to store the data for each segment. The file 
system may select a subset of the storage units, and may 
selecting the storage units for storing the segment from 
among the storage units in the selected subset. 

The functionality of the file system also may be provided 
by another application or through a code library accessible 
through an application programming interface. Accordingly, 
another aspect is the client or the process implemented 
thereby to perform read or write functions, including selec- 
tion of a storage unit and scheduling of network transfer. 
Another aspect is the storage units or the process imple- 
mented thereby to perform read or write functions, including 
selection of a storage unit and scheduling of network trans- 
fer. Another aspect is a distributed computer system imple- 
menting such functionality. These operations may be per- 
formed by a client or a storage unit using only local 
information to enable a system to be readily expandable. 

In another aspect, data is recovered in a distributed data 
storage system having a plurality of storage units for storing 
the data, wherein segments of the data and redundancy 
information stored on the storage units are randomly dis- 
tributed among the plurality of storage units, when failure of 
one of the storage units is detected. To recover the data, 
segments of which copies were stored on the failed storage 
unit are identified. The storage units on which the redun- 
dancy information corresponding to the identified segments 
was stored are identified. The redundancy information is 
used to reconstruct a copy of the identified segments, which 
are then randomly distributed among the plurality of storage 
units. Such data recovery may be used in combination with 
the read and write functionality of a file system or distributed 
storage system described herein. 

In another aspect, streams of video data are combined to 
produce composited video data which is stored in a distrib- 
uted system comprising a plurality of storage units for 
storing video data, wherein copies of segments of the video 
data stored on the storage units are randomly distributed 
among the plurality of storage units. The streams of video 
data are read from the plurality of storage units. These 
streams of video data are combined to produce the compos- 
ited video data. The composited video data is divided into 
segments. Copies of the segments of the composited video 
data are randomly distributed among the plurality of storage 
units. The reading and storage of data may be performed 
using the techniques described herein. 

BRIEF DESCRIPTION OF THE DRAWINGS 
In the drawings, 

FIG. 1A is a block diagram of an example computer 
system; 

FIG. IB is a block diagram of another embodiment of the 
system of FIG. 1A; 

FIG. 2A illustrates a data structure mapping segments of 
data to storage units 42 in FIG. 1A; 

FIG. 2B illustrates a data structure mapping segments of 
data storage units 42 in FIG. IB; 
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FIG. 3 is a flowchart describing how data may be captured Several problems arise in the design of a scalable and 

and distributed among several storage units in one embodi- reliable distributed system which supports transfer of data, 

ment; particularly multiple, independent streams of high- 

FIG. 4 is a flowchart describing how storage units may bandwidth, time-sensitive data such as motion video and 

process requests for storing data in one embodiment; 5 associated audio and other temporally continuous media, 

i-i^ * • n i_ , i m_- l r i. between multiple applications and multiple storage units. In 

HG 5 is a flowchart describing how fault recovery may such a tem f an a ~ licalion> for examp H le whic f fa ^ d l0 

be performed when a storage unit becomes unavailable; author a motion videQ prograiD) may access randomly ^ 

FIG. 6 is a flowchart describing how an additional copy eral small portions of several different files which may be 

of data may be made; distributed over several storage units. Several applications 

FIG. 7 is a flowchart describing how a copy of data may may require immediate and simultaneous access to the same 

be deleted; data, and any application should be able to access any piece 

FIG. 8 is a flowchart describing how a storage unit may of media at any time. In a system which is used for 

be removed from the system; broadcasting or other time sensitive playback, fault toler- 

FIG. 9 is a flowchart describing how data may be archived is ance also Ls desirable. Finally, the system should be both 

or copied as a backup; expandable and scalable in a manner which simplifies the 

itt^ m ,v a ' f « ™ *u addition of new storage units and new applications even 

FIG, 1U is state diagram ot a process on a storage unit for . , . . & . ^ , . • . , . 

, r • n . , J\ c \ u i t * *u * while the system is in operation. Other desirable character- 

notiiying a catalog manager of availability of the storage . . - ' . , , 

un j(. istics of such a system include a long mean time to failure, 

' ... t . i- . * . < • • , - n no single point of failure, the capability of being repaired 

FIG. U iflustrates a list of storage units which may be 20 ^ ^ opefati {Q c ^ failufe 

maintained by a catalog manager; without disrupting 0 p eration> and the capability of recover- 

FIG. 12 is a state diagram illustrating how the catalog j n g j ost data 

manager may monitor a storage unit; , n Qne embodimcntf the system includes multiple appli . 

HG. 13 illustrates a table for tracking equivalency of ^ cations connected by a computer network to multiple sepa- 

media data files; rate and independent storage units for storing data. The data 

FIG. 14 illustrates a list structure for representing a is divided into segments. Redundancy information for each 

motion video sequence of several clips; segment is determined and the segment and its redundancy 

FIG. 15 illustrates a structure of buffer memories for information are stored on a different one of the storage units, 

supporting playback of two streams of motion video data 30 The selection of a storage unit for a segment is random or 

and four streams of associated audio data at a client; pseudorandom and may be independent of the storage units 

FIG. 16 is a flowchart describing how a client may selected for other segments, such as the immediately pre- 
process a multimedia composition into requests for data ceding segment. The redundancy information and random 
from a selected storage unit* distribution of data both increases the ability of the system 

FIG. 17 is a flowchart describing how a client requests a 35 l ° efficiently transfer data in both directions between appli- 

storage unit to transfer data from primary storage into a " tl0ns an f d stora S e and im P roves fault t0 ! erance - ^ red " n ' 

buffer in one embodiment; dan , c y ^formation may be a copy of a segment. Ibis 

t-t^ • a * . . . A . replication of segments allows the system to further control 

FIG 18 is a flowchart describing how a storage unit e ^ fa particular a pp hcal ion, 

replies to requests from a client m FIG. 17; such ^ by ^ ^ stQrage ^ ^ me shortesl queue 

HG. 19 illustrates example disk queues, for prioritizing of requesls . ^ a resu lt, random fluctuations in load are 

requests for disk access to data, and network queues, for distr i buted approximately evenly over all of the storage 

prioritizing requests for network transfers of data; units, 

FIG, 20 Ls a flowchart describing how a client requests a Applications also may request data transfer with a storage 

storage unit to transfer data over the network in one embodi- 45 unit only wneD the trans f er wou id be efiBcient. By scheduling 

ment i communication over the network appropriately, network 

FIG. 21 is a flowchart describing how a storage unit congestion also may be reduced and network bandwidth 

processes requests to transfer data from multiple clients in ma y be used more efficiently. Central control points may be 

one embodiment; eliminated by having each client use local information to 

FIG. 22 is a flow chart describing an embodiment of a 50 schedule communication with a storage unit, 

network scheduling process performed by a client for trans- FIG. 1A illustrates an example computer system 40, The 

ferring data from the client to a storage unit; computer system includes a plurality of storage units 42. A 

FIG. 23 is a flow chart describing an embodiment of a storage unit is a device with a nonvolatile computer-readable 

network scheduling process performed by a storage unit for medium, such as a disk, on which data may be stored. The 

transferring data from a client to the storage unit; 55 storage unit also has faster, typically volatile, memory into 

FIG. 24 is a flow chart describing how data may be wnich data is read from the medium. Each storage unit also 

captured and distributed among several storage units in nas its own independent controller which responds to 

another embodiment- and requests for access, including but not limited to read and 

FIG. 25 is a flow chart describing how fault recovery may write access ' t0 data stored on the medium. For example, the 

be performed when a storage unit becomes unavailable in 60 stora S e umt 42 mav be a computer which stores data 

another embodiment m a data ^ e m tne ^ e svstem °f tne server. There may be 

an arbitrary number of storage units in the computer system 

DETAILED DESCRIPTION 40. 

In the following detailed description, which should be Applications 44 are systems that request access to the 

read in conjunction with the attached drawings, example 65 storage units 42 via requests to the storage units over a 

embodiments of the invention are set forth. All references computer network 46. The storage units 42 may deliver data 

cited herein are hereby expressly incorporated by reference. to or receive data from the applications 44 over the computer 
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network 46. Applications 44 may include systems which When the redundancy information is a copy of a segment, 

capture data received from a digital or analog source for the random distribution of segments may be represented in 

storing the data on the storage units 42. Applications 44 also and tracked by a segment table 90, or catalog, such as shown 

may include systems which read data from the storage units, in FIG. 2A. In particular, for data captured from a given 

such as systems for authoring, processing or playback of 5 source or for data from a given file, each segment, repre- 

multimedia programs. Other applications 44 may perform a sented by a row 92 A, has two copies, called A and B, which 

variety of fault recovery tasks. Applications 44 also may be are represented by columns 94A. The columns 94A in the 

called "clients." One or more catalog managers 49 also may segment table 90 A may be referred herein to as the "A list" 

be used. A catalog manager is a database, accessible by the or "B list," respectively. Each list alternatively may be 

applications 44, that maintains information about the data 10 represented by a seed number for a pseudorandom number 

available on the storage units 42. This embodiment may be generator that is used to generate the list, or by a list or other 

used to implement a broadcast news system such as shown suitable data structure such as a record, linked list, array, 

in PCT Publication W097/39411, dated Oct. 23, 1997. tree, table, etc. When using a pseudorandom number 

Data to be stored on the storage units 42 is divided into generator, care should be taken to ensure that the storage 

segments. Redundancy information is created based on one 15 units indicated by the numbers for any given segment in the 

or more segments. For example, each segment may be A and B lists are not the same. The contents of columns 94A 

copied. As a result, each segment is stored on at least two of indicate the storage unit on which a copy of a segment is 

the storage units 42. Alternatively, the redundancy informa- stored. 

tion may be created by the exclusive-or of two or more The random distribution of segments and redundancy 

segments. Each segment is stored on a different one of the 2 o information based on two or more segments may be repre- 

storage units 42 from its redundancy information. The sented in and tracked by a segment table 90B, or a catalog, 

selection of the storage units on which a segment and its such as shown in FIG. 2B. In particular, for data captured 

redundancy information are stored is random or pseudoran- from a given source or for data from a given file, each 

dom and may be independent of the storage units on which segment, represented by a row 92B, has a copy called A, 

other segments of the data are stored. In one embodiment, 2 s represented in column 94B. Column 96B may be used to 

two consecutive segments are not stored on the same storage indicate where the corresponding redundancy information is 

unit. The probability distribution for selecting a storage unit stored. There are several ways to indicate where the redun- 

for storing a segment and its redundancy information may be dancy information is stored. If the redundancy segments are 

uniform over all of the storage units where the identified as such in the table, then the order of the segments 

specifications, such as capacity, bandwidth and latency, of 30 in the table may be used to infer which segments correspond 

the storage units are similar. This probability distribution to a given redundancy segment. In this case column 96B 

also may be a function of the specifications of each storage may be omitted. For example, the redundancy information 

unit. The random distribution of segments of data and may be treated as another segment, having its own row 92B 

corresponding redundancy information improves both seal- in the segment table 90B. Alternatively, the column 96B 

ability and reliability. 35 may indicate the last segment in the redundancy set in which 

An example of the random distribution of copies of the segment is contained. In this embodiment, row 92B of 

segments of data is shown in FIG. 1A. In FIG. 1A, four the last segment of a redundancy set indicates a storage unit 

storage units 42, labeled w, x, y and z, store data which is on which the redundancy information for that redundancy 

divided into four segments labeled 1,2, 3 and 4. An example set is stored. In the implementation shown in FIG. 2B, 

random distribution of the segments and their copies is 40 column 96B indicates the segments within the redundancy 

shown, where: segments 1 and 3 are stored on storage unit set for tne redundancy information, 

w; segments 3 and 2 are stored on storage unit x; segments Each segment table, or file map, may be stored separately 

4 and 1 are stored on storage unit y; and segments 2 and 4 from other segment tables. Segment tables may be stored 

are stored on storage unit z. together, as a catalog. Catalogs may be stored on a catalog 

FIG. IB illustrates an embodiment where a segment and 45 manager 49, at individual clients, at a central database, or 

its corresponding redundancy information are randomly may be distributed among several databases or clients, 

distributed among the storage units. In FIG. IB, four storage Separate catalogs could be maintained, for example, for 

units 42, labeled w, x, y and z, store data which is divided different types of media programs. For example, a broadcast 

into four segments labeled 1, 2, 3 and 4. The redundancy news organization may have separate catalogs for sports 

information for a segment may be based on one or more 50 news, weather, headline news, etc. The catalogs also may be 

segments. In this example, two segments are used in what is stored on the storage units in the same manner as other data, 

called herein a "redundancy set." The exclusive-or of the For example,, each client may use a seed for a random 

segments i,j in the redundancy set is computed, thus pro- number generator to access the catalog. Such catalogs may 

viding redundancy information R t> . The exclusive-or of the be identified by other clients to access data or to handle 

redundancy information R ty and segment i produces segment 55 recovery requests, for example, by sending a network broad- 

j. Similarly, the exclusive-or of redundancy information R, y cast message to all catalog managers or clients to obtain a 

and segment j produces segment i. Each segment in a copy of the catalog or of an individual segment table, 

redundancy set and the redundancy information are stored In order to access the segments of data, each segment 

on different storage units. An example random distribution should have a unique identifier. The copies of the segments 

of segments and the redundancy information is shown in 60 may have the same unique identifier. Redundancy informa- 

FIG. IB, where: redundancy information R 3 4 for segments tion based on two or more segments has its own identifier. 

3 and 4 is stored on storage unit w; segments 2 and 3 are The unique identifier for a segment is a combination of a 

stored on storage unit x; segment 1 is stored on storage unit unique identifier for the source, such as a file, and a segment 

y; and segment 4 and redundancy information R 12 are stored number. The unique identifier for the source or file may be 

on storage unit z. The redundancy information also may be 65 determined, for example, by a system time or other unique 

created using many other techniques known in the art of identifier determined when data is captured from the source 

fault tolerance. or at the time of creation of the file. A file system, as 
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described below, may access the catalog manager to obtain also typically is created. The indexed images may corre- 
the segment table for each source or file which lists the spond to, for example, fields or frames. The index may refer 
segment identifiers and the storage units on which the to other sample boundaries, such as a period of time, for 
segments and redundancy information are stored. Each other kinds of data, such as audio. The capturing system also 
storage unit also may have a separate file system which 5 obtains a list of available storage units. One way to identify 
contains a directory of the segment identifiers and the which storage units are available is described in more detail 
location on that storage unit where they are stored. Appli- below in connection with FIGS. 10-12. 
cation programs executed by a client may use the identifiers A segment of the data is created by the capturing system 
of a source or file, and possibly a range of bytes within the in step 121. The size of the segment may be, for example, 
source or file to request data from the file system of the Q one quarter, one half or one megabyte for motion video 
client. The file system of the client then may locate the information. Audio information may be divided into, for 
segment table for the source or file, determine which seg- example, segments having a size such as one-quarter mega- 
men ts need to be accessed and select a storage unit from byte. In order to provide alignment, if possible, of the 
which the data should be read for each segment, using the segment size to divisions of storage and transmission, the 
unique segment identifiers. size of the segment may be related, i.e., an integer multiple 
Referring again to FIGS. LA and IB, when an application of, to an uncompressed or fixed data rate, disk block and 
44 requests access to a selected segment of data on one of track size, memory buffer size, and network packet (e.g., 
the storage units 42, the storage unit places the request on a 64K) and/or cell sizes (e.g., 53 bytes for ATM). If the data 
queue 48 that is maintained for the storage unit. Applications is uncompressed or is compressed using fixed-rate 
may make such requests independently of each other or any 2Q compression, the segment may be divided at temporal 
centralized control, which makes the system more readily sample boundaries which provides alignment between the 
scalable. Where the redundancy information is a copy of a image index and the segment table. Generally speaking, the 
segment, the selection of a storage unit to which a request is segment size should be driven to be larger in order to reduce 
sent may be controlled such that random fluctuations in the system overhead, which is increased by smaller segments, 
load applied by multiple applications 44 on multiple storage 25 On the other hand, there is an increased probability that a 
units 42 are balanced statistically and more equally over all convoy effect could occur if the amount of data to be stored 
of the storage units 42. For example, each request from an and segment size are such that the data is not distributed over 
application 44 may be processed by the storage unit that has all of the storage units. Additionally, there is an increased 
the shortest queue of requests. With any kind of redundancy latency to complete both disk requests and network requests 
information, the transfer of data between applications and 30 when the segment sizes are larger, 

storage units may be scheduled to reduce network conges- Next, at least two of the storage units 42 are selected, in 

tion. The requests for data may be performed in two steps: step 122, by the capturing system from the list of storage 

a pre-read request which transfers the data from disk to a units available for storing the selected segment. Selection of 

buffer on the storage unit, and a network transfer request the storage units for the copies of one segment is random or 

which transfers data over the network from the buffer to the 35 pseudorandom. This selection may be independent of the 

application. To process these two different requests, the selection made for a previous or subsequent segment. The 

queue 48 may include a disk queue and a network queue. set of storage units from which the selection is made also 

This combination of randomly distributed segments of may be a subset of all of the available storage units. The 

data and corresponding redundancy information and the selection of a set of storage units may be random or 

scheduling of data transfer over the network provides a 40 pseudorandom for each source or file. The size of this subset 

system which can transfer multiple, independent high- should be such that each storage unit has at least two 

bandwidth streams of data in both directions between mul- different segments of the data in order to minimize the 

tiple storage units and multiple applications in a scalable and likelihood of occurrence of a convoy effect. More 

reliable manner. Using copies of segments as redundancy particularly, the data should be at least twice as long (in 

information, the selection of a storage unit for read access 45 segments) as the number of storage units in the set. The size 

may be based on the relative loads of the storage units, and of the subset also should be limited to reduce the probability 

performance may be improved. that two or more storage units in the subset fail, i.e., a double 

Referring now to FIG. 3, an example process for storing fault may occur, at any given time. For example, the 

multiple copies of segments of data in a randomly distrib- probability that two storage units out of five could fail is less 

uted manner over the several storage units will now be 50 than the probability that two storage units out of one hundred 

described in more detail. An example process using redun- could so the number of storage units over which data is 

dancy information based on two or more segments is distributed should be limited. However, there is a trade off 

described below in connection with FIG. 24. The following between performance and subset size. For example, using 

description is based on the real-time capture of motion video randomly selected subsets of ten out of one-hundred storage 

data. The example may be generalized to other forms of 55 units > when two of the one-hundred storage units fail, then 

data, including, but not limited to other temporally continu- ten percent of the files are adversely affected. Without 

ous media, such as audio, or discrete media such as still subsets, one hundred percent of the files typically would be 

images or text, or even other data such as sensory data. adversely affected. 

It is generally well-known how to capture real-time In the rare likelihood of a double fault, i.e., where two or 

motion video information into a computer data file, such as 60 more storage units fail, a segment of data may be lost. In a 

described in U.S. Pat. Nos. 5,640,601 and 5,577,190. This standard video stream, the loss of a segment might result in 

procedure may be modified to include steps for dividing the a l oss of one or two frames in a minute of program material, 

captured data into segments, and copying and randomly Th e frequency of such a fault for a given source or file is a 

distributing the copies of the segments among the storage function of its bandwidth and the number of storage units. In 

units. First, in step 120, the capturing system creates a 65 particular, where: 

segment table 90A (FIG. 2A). An image index, that maps s=size of lost data in megabytes (MB), 

each image to an offset into the stream of data to be captured, n-initial number of storage units, 
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b-average bandwidth of storage units in MB per second, segment and by storing a table mapping the segment iden- 

MTBF=mean time between failures, tifier to its location on the storage unit, in step 145. This table 

MTTR=mean time to repair or replace, ma y implement the data file abstraction on the storage. unit. 

MTDFomean time for a double fault failure, and When the stora S e unit actuall y writes data t0 its main storage 

SMTOF-total system mean time between failures, 5 may depe ° d ° n othe J/ ead and write f ls Pending for 

other applications. The management of these concurrent 

.„__ . ,^ nr , requests is addressed in more detail below. The file then may 

_ „ Mi or . 1 MiBr M lor . . , . + * « < < , 

SMTBF = . and mtdf = — — • * * - — be closed in step 146. An acknowledgment may be sent to 

the capturmg system in step 148. 
30 When the process of FIGS. 3 and 4 is complete, the 
As an example, in a system with 100 storage units, each with captured data is randomly distributed, with at least two 
a capacity of 50 gigabytes, where MTTR is one hour and copies for each segment, over several storage units. Multiple 
MTBF is 1000 hours or six weeks, there likely will be 115 applications may request access to this data. The manner in 
years to double fault failure. If the MTTR is increased to which this access occurs is likely to be random. Accordingly, 
twenty-four hours, then there likely will be 4.8 years to 35 it should be apparent that any storage unit may receive 
double fault failure. multiple requests for both reading data from and writing data 
Referring again to FIG. 3, after two storage units are to files stored on the storage unit from multiple applications, 
selected, the current segment then is sent to each of the In order to manage the requests, a queue 48 of requests is 
selected storage units in step 124 for storage. These write maintained by each of the storage units 42, as mentioned 
requests may be asynchronous rather than sequential. The 20 above. In the following description of an example 
capture system then may wait for all storage units to embodiment, a storage unit maintains two queues: one for 
acknowledge completion of the storage of the segment in the requests for disk access, and another for requests for net- 
step 126. When data is stored in real time while being work transfers. One embodiment of these disk and network 
captured, the data transfer in step 124 may occur in two queues is described in more detail below in connection with 
steps, similar to read operations discussed in more detail 25 FIG. 19. 

below. In particular, the client first may request a storage unit When data is requested by an application program 

to prepare a free buffer for storing the data. The storage unit executed on a client 44, a storage unit is selected to satisfy 

may reply with an estimated time for availability of the the request when each segment of data is stored on at least 

buffer. When that estimated time is reached, the capture two storage units. The segment table 90 for the requested 

system may request the storage unit to receive the data. The 30 data is used for this purpose. The selection of a storage unit 

storage unit then may receive the data in its buffer, then may be performed by the application program requesting the 

transfer the data in its buffer to its storage medium and send data, by a file system of the client executing the application 

an acknowledgment to the capture system. program, through coordination among storage units or by 

If a time out occurs before an acknowledgment is received another application such as a catalog manager. The selection 

by the capturing system, the segment may be sent again 35 may be random or pseudorandom, or based on a least 

either to the same storage unit or to a different storage unit. recently used algorithm, or based on the relative lengths of 

Other errors also may be handled by the capturing system. the queues of the storage units. By selecting a storage unit 

The operations which ensure successful storage of the data based on the relative lengths of the queues on the available 

on the selected units may be performed by a separate thread storage units, the load of the multiple applications may be 

for each copy of the segment. 40 distributed more equally over the set of storage units. Such 

After the data is successfully stored on the storage units, selection will be described in more detail below in connec- 

the segment table 90 is updated by the capturing system in tion with FIGS. 16-18. 

step 127. If capture is complete, as determined in step 128, More details of a particular embodiment will now be 

then the process terminates; otherwise, the process is described. For this purpose, the storage unit 42 may be 

repeated for the next segment by returning to step 121. The 45 implemented as a server or as an independently controlled 

segment table may be maintained, e.g., in main memory, at disk storage unit, whereas the applications 44 are called 

the capture system as part of the file system. While the clients. Clients may execute application programs that per- 

capturing system manages the segment table and selection of form various tasks. A suitable computer system to imple- 

storage units in this example, other parts of the system could ment either the servers or clients typically includes a main 

coordinate these activities as well, such as the catalog 50 unit that generally includes a processor connected to a 

manager 49. The updated segment table may be sent to, for memory system via an interconnection mechanism, such as 

example, the catalog manager in step 129. Alternatively, the a bus or switch. Both the server and client also have a 

catalog manager may produce the segment table by using network interface to connect them to a computer network, 

accumulated knowledge of system operation, and may send The network interface may be redundant to support fault 

this table to the capture system on request. 55 tolerance. The client also may have an output device, such 

FIG. 4 is a flowchart describing in more detail how a as a display, and an input device, such as a keyboard. Both 

storage unit stores a segment of the captured data or redun- the input device and the output device may be connected to 

dancy information. The storage unit receives the segment of the processor and memory system via the interconnection 

data from a capturing system in step 140 and stores the data mechanism. 

in a buffer at the storage unit. Assuming the storage unit uses 60 It should be understood that one or more output devices 

data liles for storage, the storage unit opens a data file in step may be connected to the client system. Example output 

142 and stores the data in the data file in step 144. The devices include a cathode ray tube (CRT) display, liquid 

catalog manager may specify the location where the segment crystal displays (LCD), printers, communication devices 

should be stored. The data may be appended to an existing such as a modem or network interface, and video and audio 

data file or may be stored in a separate data file. As discussed 65 output. It should also be understood that one or more input 

above, the storage unit or the catalog manager may keep devices may be connected to the client system. Example 

track of segments by using a unique identifier for each input devices include a keyboard, keypad, trackball, mouse, 
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pen and tablet, communication devices such as a modem or causes data to be read from the nonvolatile recording 

network interface, video and audio digitizers and scanner. It medium into an integrated circuit memory element, which is 

should be understood the invention is not limited to the typically a volatile, random access memory such as a 

particular input or output devices used in combination with dynamic random access memory (DRAM) or static memory 

the computer system or to those described herein. 5 (SRAM). The integrated circuit memory element allows for 

The computer system may be a general purpose computer faster access to the information by the processor than does 

system which is programmable using a high level computer the disk. The processor generally manipulates the data 

programming language, such as the "C" and "C++" pro- within the integrated circuit memory and then copies the 

gramming languages. The computer system also may be data to the disk when processing is completed. A variety of 

specially programmed, special purpose hardware. In a gen- 10 mechanisms are known for managing data movement 

eral purpose computer system, the processor is typically a between the disk and the integrated circuit memory element, 

commercially available processor, of which the series x86 and the invention is not limited thereto. It should also be 

processors such as the Pentium II processor with MMX understood that the invention is not limited to a particular 

technology, available from Intel and similar devices avail- memory system. 

able from AMD and Cyrix, the 680X0 series microproces- 15 It should be understood the invention is not limited to a 
sors available from Motorola, the Alpha series microproces- particular computer platform, particular processor, or par- 
sor available from Digital Equipment Corporation, and the ticular high-level programming language. Additionally, the 
PowerPC processors available from IBM are examples. computer system may be a multiprocessor computer system 
Many other processors arc available. Such a microprocessor or may include multiple computers connected over a corn- 
may execute a program called an operating system, of which 20 puter network. 

the WindowsNT, Windows 95, UNIX, IRIX, Solaris, DOS, As stated above, each storage unit 42, if accessed through 

VMS, Vx Works, OS/Warp, Mac OS System 7 and OS8 server, and each application 44 may have a file system, 

operating systems are examples. The operating system con- typically part of the operating system, which maintains files 

trols the execution of other computer programs and provides of data. A file is a named logical construct which is defined 

scheduling, debugging, input/output control, compilation, 25 and implemented by the file system to map the name and a 

storage assignment, data management and memory sequence of logical records of data to locations on physical 

management, and communication control and related ser- storage media. While the file system masks the physical 

vices. The processor and operating system define a computer locations of data from the application program, a file system 

platform for which application programs in high-level pro- generally attempts to store data of one file in contiguous 

gramming languages are written. 30 blocks on the physical storage media. A file may specifically 

Each server may be implemented using an inexpensive support various record types or may leave them undefined to 

computer with a substantial amount of main memory, e.g., be interpreted or controlled by application programs. A file 

much more than thirty-two megabytes, and disk capacity, is referred to by its name or other identifier by application 

e.g., several gigabytes. The disk may be one or more simple programs and is accessed through the file system using 

disks or redundant arrays of independent disks (RAID) or a 35 commands defined by the operating system. An operating 

combination thereof. For example, the server may be a system provides basic file operations for creating a file, 

Pentium or 486 microprocessor-based system, with an oper- opening a file, writing a file, reading a file and closing a file, 

ating system such as WindowsNT or a real-time operating These operations may be synchronous or asynchronous, 

system such as Vx Works. The authoring system, capturing depending on the file system. 

system and playback system may be implemented using 40 As described herein, data of a file or source is stored in 
platforms that currently are used in the art for those kinds of segments, of which copies or other form of redundancy 
products. For example, the MEDIACOMPOSER authoring information are randomly distributed among multiple sub- 
system from Avid Technology, Inc., of Tewksbury, Mass., age units. 

uses a Power Macintosh computer from Apple Computer, Generally speaking for most file systems, in order to 

Inc., that has a PowerPC microprocessor and a MacOS 45 create a file, the operating system first identifies space in the 

System 7 operating system. A system based on a Pentium II storage which is controlled by the file system. An entry for 

processor with MMX technology from Intel, with the Win- the new file is then made in a catalog which includes entries 

dowsNT operating system, also may be used. Example indicating the names of the available files and their locations 

playback systems include the "SPACE" system from Pluto in the file system. Creation of a file may include allocating 

Technologies International Inc., of Boulder, Colo., or the 50 certain available space to the file. In one embodiment, a 

AIRPLAY system from Avid Technology which uses a segment table for the file may be created. Opening a file 

Macintosh platform. The catalog manager may be imple- typically returns a handle to the application program which 

mented using any platform that supports a suitable database it uses to access the file. Closing a file invalidates the handle, 

system such as the Informix database. Similarly, an asset The file system may use the handle to identify the segment 

manager that tracks the kinds of data available in the system 55 table for a file. 

may be implemented using such a database. In order to write data to a file, an application program 

The memory system in the computer typically includes a issues a command to the operating system which specifies 

computer readable and writeable nonvolatile recording both an indicator of the file, such as a file name, handle or 

medium, of which a magnetic disk, optical disk, a flash other descriptor, and the information to be written to the file, 

memory and tape are examples. The disk may be removable, 60 Generally speaking, given the indicator of the file, an 

such as a floppy disk or CD-ROM, or fixed, such as a hard operating system searches the directory to find the location 

drive. A disk has a number of tracks in which signals are of the file. The data may be written to a known location 

stored, typically in binary form, i.e., a form interpreted as a within the file or at the end of the file. The directory entry 

sequence of ones and zeros. Such signals may define an may store a pointer, called a write pointer, to the current end 

application program to be executed by the microprocessor, 65 of the file. Using this pointer, the physical location of the 

or information stored on the disk to be processed by the next available block of storage may be computed and the 

application program. Typically, in operation, the processor information may be written to that block. The write pointer 



04/06/2004, EAST Version: 1.4.1 



US 6,374,336 Bl 

15 16 

may be updated in the directory to indicate the new end of switch-based network in which each node, i.e., client or 

the file. In one embodiment, the write operation randomly storage unit, is connected directly to the same switch may be 

distributes copies of segments of the file among the storage used. In some implementations, multiple clients and storage 

units and updates the segment table for the file. The write units may be connected on a physical loop or subnetwork 

operation also may cause a segment and corresponding 5 which are interconnected into a switching fabric. The system 

redundancy information to be stored on different storage also may be connected using multiple switches, 

units. The network also has a network architecture which 

In order to read data from a file, an application program defines the protocols, message formats, and other standards 

issues a command to the operating system specifying the to which communication hardware and software conform in 

indicator of the file and memory locations assigned to the 10 order for communication to occur between devices on the 

application where the read data should be placed. Generally network. A commonly -used network architecture is the 

speaking, an operating system searches its directory for the International Standards Organization seven-layer model 

associated entry given the indicator of the file. The appli- known as the Open Systems Interconnection reference 

cation program may specify some offset from the beginning model. The seven layers are the application, presentation, 

of the file to be used, or, in a sequential file system, the is session, transport, network, link and physical layers. Each 

directory may provide a pointer to a next block of data to be machine communicates with any other machine using the 

read. In one embodiment, the selection of a storage unit and same communication protocol at one of these layers, 

the scheduling of data transfer is implemented as part of the In one embodiment, the link layer preferably is one that 

read operation of the file system of the client. retains the order of packets as they are received at the client 

The client may use a file system or a special code library 20 in order to avoid the potential for an unlimited latency, 

with a defined application programming interface (API) to Accordingly, suitable link layer protocols include asynchro- 

translate requests for portions of a file into requests for nous transfer mode (ATM) networks, such as OC3, OC1 2, or 

segments of data from selected storage units. The storage higher bandwidth networks. An ATM system operating in 

unit may have its own file system which may be entirely the AAL5 mode is preferable. Ethernet networks with 100 

separate from the client file system. All of the segments on 25 Tx to gigabit (1,000 Tx) capacity also may provide efficient 

a storage unit may be stored, for example, in a single file at packet transmission from the source to the destination, 

the storage unit. Alternatively, the client file system may use Suitable Ethernet network platforms are available, for 

the storage units over the network as raw storage, using the example, from 3Com of Santa Clara, Calif. An example 

catalog manager and segment tables to implement the file ATM system is available from Fore Systems of Warrendale, 

abstraction. The segment table for a file also may indicate 30 Pa. or Giga-Net, of Concord, Mass. A FibreChannel, FDDI 

the locations of each segment on the storage units selected or HIPPI network also may be used. The different clients, the 

for the segment. catalog manager and the storage units all may communicate 

A primary advantage of using a file system is that, for an using the link layer protocol. Communication at this layer 
application program, the file is a logical construct which can also reduces overhead due to memory copies performed to 
be created, opened, written to, read from and closed without 35 process encapsulated data for each layer's protocol. A band- 
any concern for the physical storage medium or location on width distributed network file system from Polybus Systems 
that medium used by the operating system to store the data. Corporation in Tyngsboro, Mass., may be used. 
In a network file system, the file system manages requests Having now described computer platforms for one 
for data from a specified file from the various storage units, embodiment, some additional operations and details of one 
without requiring an application program to know any 40 embodiment will now be described, 
details about the physical storage where the data is stored or In one embodiment, there are processes for maintaining 
the computer network. If the storage unit has its own the storage units and the data stored on the storage units. For 
independent file system, the client file system also need not example, fault recovery procedures may involve the creation 
know details of the storage mechanism of the storage units. of additional copies of a file. Additionally, files may be 
The storage units may use, for example, the file system 45 deleted or added based on the need for availability of, i.e., 
associated with their own operating system, such as the reliability of access to, the file. Finally, some maintenance 
WindowsNT file system or the file system of a real time procedures may involve deleting files on a storage unit, 
operating system such as VxWorks, or a file system that copying the files to another storage unit and removing the 
allows asynchronous operations. storage unit from the system. A file also may be archived, or 

The storage units are interconnected with the clients and, 50 removed from the system to archival storage. These pro- 

optionally, the catalog manager using a computer network. cesses will now be described in more detail in connection 

A computer network is a set of communications channels with FIGS. 5-9. Such data management processes may be 

interconnecting a set of computer devices or nodes that can performed by the catalog manager, another storage unit, or 

communicate with each other. The nodes may be computers a client. The performance of these processes by a client 

such as the clients, storage units and catalog managers, or 55 would not occupy the resources of the catalog manager or 

communication devices of various kinds, such as switches, storage units, which may be used for other more important 

routers, gateways and other network devices. The commu- tasks, such as replying to client requests for data, 

nication channels may use a variety of transmission media FIG. 5 is a flowchart describing in more detail how fault 

including optical fibers, coaxial cable, twisted copper pairs, recovery may be performed when a storage unit becomes 

satellite links, digital microwave radio, etc. 60 unavailable after its failure is detected. One way to detect 

A computer network has a topology which is the geo- such failure is described in more detail below in connection 

metrical arrangement of the connection of the nodes by the with FIGS. 10-12. Repeated failures to respond to requests 

network. Kinds of topologies include point-to-point also may be used to indicate failures. The success of this 

connection, linear bus, ring connection, star connection, and process depends on the number of copies of each segment 

multiconnected networks. A network may use various com- 65 within the system or a number of segments in a redundancy 

binations of these basic topologies. The topology may vary set. Given a number N of copies, then N-l storage units may 

depending on the physical installation. A non-blocking, fail and the system still will operate without loss of data. 
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After a storage unit fails, a new storage unit may be installed 
in its place, with lost data restored, or the lost data may be 
recreated and distributed over the remaining storage units. 
FIG. 5 describes a process for when the redundancy infor- 
mation is a copy of a segment. FIG. 25, described below, 
illustrates a process for when the redundancy information is 
based on two or more segments. 

Additional copies of data may be made by first selecting 
the data, e.g., a file or source to be recovered, in step 200. 
The file to be recovered may be selected by a priority 
ordering, and may be selected either automatically or manu- 
ally. This kind of recovery allows data from some files to be 
reconstructed and made available before data from other 
files is recovered. The lost segments of the data, i.e., those 
stored on the lost storage unit, are identified in step 202 
using the segment table for the source. A new storage unit for 
each lost segment is selected in step 204, typically in the 
same manner as when data is originally captured, when a 
new storage unit is not available to replace the failed storage' 
unit. Alternatively, the replacement storage unit is selected. 
A copy of the lost segment is read from an alternate storage 
unit in step 206 and stored in the selected storage unit. The 
file operations for steps 204 through 208 may be asynchro- 
nous and performed by separate threads for each segment. 
Such operation takes advantage of the many-to-many read/ 
write capability provided in this network architecture. The 
segment table for the file then is updated upon the successful 
completion of the copy operation in step 208. When the 
process is complete, the catalog manager may be updated 
with the new segment table in step 209, if a catalog manager 
maintains the segment tables. If the original segment table 
was represented by a seed to a pseudorandom sequence 
generator, the actual table may need to be created and 
modified. 

The speed of repopulation and redundancy restoration for 
an unloaded system using this process is defined by the 
following equation: 



(»-l+d)(t/2)' 



where: 

sosize of lost files in megabytes (MB), 

n= initial number of storage units, 

b=average bandwidth of storage units, expressed in 
MB/second, and 

d=user demand load, expressed in MB/second. 
For example, if access to 50 gigabytes of storage is lost 
because one often storage units fails, then with n=10 storage 
units, with unit bandwidth b=10 MB/sec, then (n-l)=9 and 
(b/2)=5. Thus, recovery would take approximately 20 min- 
utes with no other loads. This absolute recovery speed 
generally is reduced as a reciprocal of the varying playback 
load to clients, e.g., a 50% load results in 200% increase in 
repopulation time. When invoked, the redistribution task can 
run at a fast rate with multiple storage unit checkerboard 
switched to multiple storage units, but repopulation activi- 
ties operate opportunistically, subordinated to client file 
service requests. The net effect is only a slight loss of total 
bandwidth of the storage units due to the failed storage unit. 
Prioritization of the file selection for recovery ensures that 
the most important files are recovered most quickly. 

FIG. 6 is a flowchart describing in more detail how an 
additional copy of data may be made. This process may be 
invoked to make additional data copies available of mission 
critical or high-demand data. A date-stamp may be given to 



the new copy to indicate when the copy may be deleted. 
Given selected data, a segment of the data is selected in step 
210. Each segment is assigned randomly a new storage unit 
in step 212, ensuring that each storage unit has at most one 
copy of a given segment. Next, the segment is stored on the 
selected storage unit in step 214, Upon successful comple- 
tion of the storage of that segment, the segment table for the 
data is updated in step 216. If all of the segments of the data 
have not yet been copied, as determined in step 217, the 
process repeats by returning to step 210 to select the next 
segment of the data. When the process is complete, the 
catalog manager may be updated with the new segment table 
in step 218, if the catalog manager maintains the segment 
tables. Although this process is sequential over the 
segments, each segment may be processed using a separate 
thread, and the file operation of step 214 may be asynchro- 
nous. Such processing enables the copy to be made quickly. 
With this procedure, the segment table still may be repre- 
sented using the seed for the pseudorandom number gen- 
erator. 

FIG. 7 is a flowchart describing in more detail how a copy 
of data is deleted. This process may be invoked, for example, 
when data is no longer in high demand. For example, a date 
stamp on a copy may be used to indicate when the data 
should be deleted. Given the segment table shown in FIG. 2 
for given data, one of the sets of copies, i.e., a column in the 
table, is selected in step 220. Each segment in the column is 
deleted in step 222. Upon successful completion of the 
delete operation in step 222 for each segment, the segment 
table is updated in step 224. Steps 222 and 224 are repeated 
for segment. This process may be sequential over the 
segments or each segment may be processed by a separate 
thread. When the process is complete, the catalog manager 
may be updated with the new segment table in step 226, if 
the catalog manager maintains the segments tables. 

FIG. 8 is a flowchart describing how an otherwise active 
storage unit may be removed from the system. The data 
available on the storage unit is identified, for example by 
identifying a list of its files using its file system. First, the 
storage unit is made unavailable for writing new segments. 

This step may be accomplished, for example, by notifying 
the catalog manager or by sending a broadcast message to all 
clients. The segments of each file are redistributed on the 
other storage units before the storage unit is removed from 
the system. Given this list of files, the next file to be 
processed is selected in step 230. Using the segment table, 
all segments of this file on the storage unit, including 
segments containing redundancy information, are identified 
in step 232. The next segment to be processed is selected in 
step 234. The selected segment is assigned a new storage 
unit in step 235 by a random selection from the remaining 
storage units, assuring that no storage unit has more than one 
copy of a given segment. The data is then written to the 
newly selected storage unit in step 236. Upon successful 
completion of that write operation, the segment table is 
updated. When all the segments for a given file are 
redistributed, as determined in step 238, the segment table 
may be sent to the catalog manager if appropriate in step 
239. The segments may be processed sequentially or by 
separate threads using asynchronous file operations. The 
segments may be deleted from the old storage unit after the 
catalog manager is updated. Processing continues with the 
next file, if any, as determined in step 240. If all files have 
been redistributed, this process is complete and the storage 
65 unit may be removed from the system. 

FIG. 9 is a flowchart describing how data may be archived 
or copied for backup. This process involves copying of one 
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copy of each segment of the data from the available storage state in which the count value 76 for the storage unit in the 

units into a backup storage system, such as an archival list 70 of storage units is incremented for the storage unit in 

storage medium. Each copy set and any redundancy infor- response to periodic timer interrupts. When a "ping" is 

mation also may be deleted from all storage units. This received from the storage unit, the transition occurs to state 

process may be performed by selecting a copy set, e.g., the 5 82 - Io state 82. tne presence of this storage unit in list 70 is 

A list, from a column of the segment table in step 250. verified. If the storage unit is in the list 70, the count 76 for 

Alternatively, each segment may be read in order and the lhe stora S e unit is reset » other information about the storage 

selection of a storage unit for each segment may be per- u f* mav * "I*** and a ^lion back to state 80 occurs, 

formed using techniques applied by other applications as If lhe f tora ^ um ; ls t not m . the ^V! * ff^° ^ ^ 

described above. Each segment from the selected copy set is 10 a resct count «d a tonsiUon back to s ate 80 occurs. After 

. . . t . B . . , t VJ a given increment, if the count for the storage unit is greater 

read from its storage unit and is stored on a storage medium ^ a determined time out valuej such % three h * ndred 

in step 252 Upon successful copying of each segment to the milliseoondSf faull recovery procedu res are performed. In 

storage medium, all of the remaining segments from all the particular, the storage unit is removed from list 70 and fault 

remaining copy sets or any redundancy information may be to lerant procedures are performed in state 84. If a "ping" 

deleted from the storage units in step 254, The segments may is f rom a storage unit is received by the catalog manager and 

be processed sequentially or by separate threads using if t Da t storage unit does not have a corresponding tracking 

asynchronous file operations. The catalog manager then may process, then the catalog manager adds the storage unit to 

be updated in step 256. the list and creates a tracking process for the storage unit. 

How the storage units may be monitored to determine In addition to having a catalog manager 49, the system 

availability and to detect failures will now be described in 20 also may include a database, called an asset manager, which 

connection with FIGS. 10 through 12. There are several stores a variety of data about the media sources available in 

ways to determine whether storage units are available, the system such as an index for each file. The catalog 

including polling the storage units, handling exceptions manager and asset manager also may be combined. One 

from the storage units, or by the storage units periodically useful kind of information for storing in the asset manager 

informing an application or applications of their availability. 25 is a table, shown in FIG, 13, that relates equivalent data files 

In one embodiment, in addition to the catalog manager 49 or based on a source identifier and a range within that source, 

some other client both may monitor which storage units 42 such as shown in U.S. Pat. No. 5,267,351. The source 

are active in the system and maintain a catalog of segment identifier is an indication of the original source of data, 

tables for each file. One method for monitoring the storage which may be an analog source, whereas the data actually 

units is shown in FIGS. 10-12. Each storage unit available 30 available is a digitized copy of that source stored on the 

on the system establishes a process which periodically storage units. In particular, the table has an entry for a source 

informs the catalog manager that it is available. In particular, identifier 100, a range within the source identifier 102, and 

this process may be considered a state machine having a first an indication 104, such as list of data files, of equivalent data 

state 60 in which the storage unit periodically increments a from that source. The list 104 may be used to identify one 

counter, for example, in response to a timer interrupt or 35 of the data files for a source, and in turn access the segment 

event from a system timer. When this counter reaches a table for that file to determine where segments of the data are 

certain predetermined amount, such as a hundred distributed on the various storage units. The segment table 

milliseconds, a transition to another state 62 occurs. In the 90A of FIG. 2 A may be incorporated into this list 104 of 

transition to state 62, a signal, called a "ping," is sent to the FIG. 13 as shown at 106 and 108. The segment table 90B of 

catalog manager by the storage unit. This signal may be a 40 FIG. 2B similarly may be incorporated into list 104. Such 

small message, even one ATM cell, that does not use much equivalency among data also may be maintained by any 

bandwidth to transmit. This signal may include an identifier application program. 

of the storage unit, and possibly other information such as Since the catalog manager is a database that monitors how 

the capacity, efficiency and/or bandwidth availability of the data is distributed on the various storage units, it also should 

storage unit. At the next timer interrupt or event, the counter 45 be designed to enhance fault tolerance and availability and 

is reset and a transition back to state 60 occurs. to reduce its likelihood of being a bottleneck. Accordingly, 

The catalog manager may keep track of the available the catalog manager should be implemented using conven- 

storage units. For this purpose, the catalog manager may use tional distributed database management techniques. Also, 

a list 70 of storage units, an example of which is shown in highly available machines, such as those from Marathon 

FIG. 11. This list of storage units may be implemented as a 50 Technologies, Tandem Computers, Stratus, and Texas 

table indexed by the identifiers of the storage units as Micro, Inc., may be used to implement the catalog manager, 

indicated at 72. If the storage unit is present or available, the There also may be several catalog managers that are used by 

bandwidth, memory capacity or other information about the separate client applications. Alternatively, each client appli- 

power of the storage unit is made available in column 74. cation may maintain its own copy of catalogs locally, using 

The count since the last "ping" from the storage unit also is 55 standard techniques to maintain consistency between mul- 

present as indicated in column 76. If this count exceeds a tiple copies of the data. In this manner, a catalog manager is 

predetermined amount, such as three hundred milliseconds, not a central point of failure. A client also may act as its own 

the storage unit is considered out of service and fault catalog manager. The catalogs also may be treated as data of 

recovery procedures, such as described above, may be which its segments and redundancy information are ran- 

followed. An example tracking process which maintains the 60 domly distributed among the storage units. Each client may 

list 70 of storage units will now be described in more detail have segment table, or random number generator seed 

in connection with FIG. 12. representing the segment table, for each catalog. 

FIG. 12 is a state machine describing a tracking process Having now described how data may be captured and 

which may be performed by the catalog manager to deter- stored onto storage units, and how the storage of data on the 

mine which storage units are available. One of these state 65 storage units may be managed, client applications that 

machines may be established for each storage unit as a perform authoring and playback will now be described in 

process on the catalog manager. The first state 80 is a waiting more detail in connection with FIGS. 14 and 15. 
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There are several kinds of systems that may be used to divided into independently accessible portions 298 that 

author, process and display multimedia data. These systems correspond in size to the size of data packets for which 

may be used to modify the data, define different combina- transfer is scheduled over the network, 

tions of data, create new data and display data to a user. A Because the video and audio data may be stored in 

variety of techniques are known in the art for implementing 5 different data files and may be combined arbitrarily, better 

these kinds of systems. performance may be obtained if requests for data for these 

Multimedia authoring, processing and playback systems different streams on the client side are managed efficiently, 

typically have a data structure which represents the multi- For example, the client application may identify a stream for 

media composition. The data structure ultimately refers to which data can be read, and then may determine an amount 

clips of source material, such as digitized video or audio, 10 of data which should be read, if any. A process for perform- 

using an identifier of the source material, such as a unique ing this- kind of management of read operations is shown in 

identifier or a file name, and possibly a temporal range U.S. Pat. No. 5,045,940. In general, the client determines 

within the source material defining the clip. The identifier which stream has the least amount of data available for 

may be of a type that may be used with a list of equivalent display. If there is a sufficient amount of buffer space in the 

data files to identify a file name for the source material. An 15 set of buffers for that stream to efficiently read an amount of 

index may be used to translate the temporal range in the data, then that data is requested. It is generally efficient to 

source into a range of bytes within a corresponding file. This read data when the available space in memory for the 

range of bytes may be used with the segment table for the file selected stream is large enough to hold one network trans- 

to identify segments of data that are needed and the storage mission unit of data. When it is determined that data for a 

units from which the data is retrieved. 20 stream should be requested, each segment of the data is 

FIG. 14 shows an example list structure that may be used requested from a storage unit selected from those on which 

to represent part of a multimedia composition. In an example the segment is stored. 

shown in FIG. 14, there are several clips 260, each of which A general overview of a process by which a composition 

includes a reference to a source identifier, indicated at 262, may be converted into requests for data in order to display 

and a range within the source, as indicated at 264. Generally, 25 the data will now be described in connection with FIG. 16. 

there may be such a list for each track of media in a temporal In order to know what files to request from the storage unit, 

composition. There are a variety of data structures which an application program executed on the client system may 

may be used to represent a composition. In addition to a list convert a data structure representing a composition, such as 

structure, a more complex structure is shown in PCT Pub- shown in FIG. 14, into file names and ranges within those 

lished Application W093/21636 published on Oct. 28, 1993. 30 files in step 270 in FIG. 16. For example, for each source 

Other example representations of multimedia compositions identifier and range within that source, a request may be sent 

include those defined by Open Media Framework Inter- to the asset manager. In response, the asset manager may 

change Specification from Avid Technology, Inc., Advanced return a file name for a file containing equivalent media 

Authoring Fornat (AAF) from the multimedia Task Force, corresponding to the received source identifier and range. 

QuickTime from Apple Computer, DirectShow from 35 The segment table for the file and the list of available storage 

Microsoft, and Bento also from Apple Computer, and as units also may be catalog manager. 

shown in PCT Publication WO96/26600. When the client requests a segment of data for a particular 

The data structure described above and used to represent data stream, the client selects a storage unit, in step 272, for 

multimedia programs may use multiple types of data that are the segment that is requested. This selection, in one erabodi- 

synchronized and displayed. The most common example is 40 menl where the redundancy is provided by copying each 

a television program or film production which includes segment, will be described in more detail below in connec- 

motion video (often two or more streams or tracks) with tion with FIGS. 17 and 18. In general, the storage unit with 

associated audio (often four or more streams or tracks). As the shortest queue 48 (FIG. 1) may be selected. The client 

shown in FIG. 15, the client computer may have a corre- then reads the data from the selected storage unit for the 

sponding set 290 of memory buffers 294 allocated in the 45 segment, in steps 274 through 278. Step 274 may be 

main memory. Each buffer may be implemented as a "seri- understood as a pre-read step in which the client sends a 

alizing" buffer. In other words, the client inserts data request to a storage unit to read desired data from nonvola- 

received from a storage unit into these independently acces- tile storage into faster, typically volatile storage. The request 

sible portions and reads from the set of buffers sequentially. to the storage unit may include an indication of how much 

Since requests may be sent to several storage units and data 50 time is required from the time the request is made until that 

may be received at different times for the same stream, the requested data must be received at the client, i.e., a due time, 

buffers may not be filled in sequence when written, but are After a pre-read request is accepted, the client waits in step 

read out in sequence to be displayed. In FIG. 15, the filled 276. The request is placed in the storage unit's queue 48, and 

in buffers indicate the presence of data in the buffer. Any the due time may be used to prioritize requests as described 

empty buffer may be filled at any time as indicated at 293 55 below. Data is transferred from the storage unit in step 278 

and 295. However, each set of buffers has a current read after data becomes available in a buffer at the storage unit, 

location 291 from which data is read and which advances as This step may involve scheduling of the network usage to 

time progress as indicated in 297. A subset 292, 296 of these transfer the data to maximize efficiency of network utiliza- 

buffers may be allocated to each stream of data. tion. The received data is stored in the appropriate buffer at 

Each buffer in the set of buffers has a size that corresponds 60 the client, and ultimately is processed and displayed in step 

to a fixed number of segments of data, where the segment 280. If the segment is lost at the storage unit, the redundancy 

size is the size of file segments stored on the storage units. information may be used to reconstruct the segment. 

There may be several, e.g., four, audio buffers per stream There are several ways to initiate the pre-read requests, 

292 of audio data, where each buffer may contain several, including selection of a storage unit, in step 274 and the data 

e.g., four, segments. Similarly, each video stream 296 may 65 transfer in step 278. For example, the MediaComposer 

have several, e.g., four, buffers each of which contains authoring system from Avid Technology, Inc., of Tewksbury, 

several, e.g., four, segments. Each of the buffers may be Mass., allows a user to set either a number of clips or an 
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amount of time as a look-ahead value, indicating how far 
ahead in a composition the application should initiate read 
requests for data. A program schedule for a television 
broadcast facility also may be used for this purpose. Such 
information may be used to initiate selection of a storage 5 
unit and pre-read requests. Such pre-reads may be performed 
even if buffer space is not available in buffers 290 (FIG. 15), 
as is shown in European patent application 0674414 A2, 
published Sep. 9, 1995. The amount of available space in the 
buffers 290 (FIG. 15) may be used to initiate data transfers 10 
in step 278 (FIG. 16), or to initiate both pre-reads (step 274) 
and data transfers (step 278). 

One process which enables a client to make an adequate 
estimate of which storage unit has the shortest queue of 
requests, without requiring an exhaustive search of all the 15 
available storage units, will now be described in connection 
with FIGS. 17 and 18. First, the client sends a request with 
a threshold El to a first storage unit in step 330. The 
threshold El is a value indicating an estimate of time by 
which the request should be serviced. This estimate may be 20 
expressed as a time value, a number of requests in the disk 
queue of the storage unit, such as four, or other measure. The 
meaning of this threshold is that the request should be 
accepted by the storage unit if the storage unit can service 
the request within the specified time limit, for example. The 25 
client receives a reply from the storage unit in step 332. The 
reply indicates whether the request was accepted and placed 
in the disk queue of the storage unit or whether the request 
was rejected as determined in step 334. If the request is 
accepted, the client is given an estimate of time at which the 30 
data will be available in a buffer at the storage unit in step 
336. For example, if the data for the requested segment 
already is in a buffer, the storage unit indicates that the data 
is immediately available. The client then may wait until it is 
time to request transfer of the data (step 278 in FIG. 16) 35 
some time after the estimated time has passed. If the request 
is rejected, an estimate of the amount of time the storage unit 
actually is likely to take, such as the actual size in number 
of entries of the disk queue, is returned from the storage unit. 
This actual estimate is added to a value K to obtain a 40 
threshold E2 in step 340. The value K may be two, if 
representing a number of disk queue entries. Threshold El 
and value K may be user-definable. A request is sent to a 
second storage unit in step 342 indicating the threshold E2. 
The client then receives a reply in step 344, similar to the 45 
reply received in step 332. If this reply indicates that the 
request was accepted, as determined in 346, the client has an 
estimate of time at which the data will be available at the 
second storage unit, as indicated in step 336 after which the 
client may wait to schedule the data transfer. Otherwise, an 50 
unconditional request, one with a large threshold, is sent to 
the first storage unit in step 348. An acknowledgment then 
is received in step 350 indicating the estimate of time at 
which the data will be available in a buffer at the storage 
unit, as indicated at step 336. 55 

The storage unit, on the other hand, does not know 
whether it is the first or second storage unit selected by the 
client when it receives a request. Rather, the storage unit 
simply receives requests as indicated in step 360. The 
threshold indicated in the request is compared to the storage 60 
unit's own estimate of the time the client will need to wait 
in step 362, for example by comparing the size of the disk 
queue of the storage unit to the specified threshold. If the 
threshold in the request is greater than the estimate made by 
storage unit, the request is placed in the disk queue and an 65 
estimate of the time when the data will be available in a 
buffer at the storage unit is determined in step 364. This 
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estimate may be determined, for example, based on disk 
access speed, disk queue length and possibly a running 
average of recent performance. An acknowledgment is sent 
to the client in step 336 including the estimated time of 
availability of the data in the buffer at the storage unit. 
Otherwise, a rejection is sent in step 368 indicating this 
estimate, such as the actual size of the disk queue. 

The storage unit may keep track of which segments are in 
which buffers on the storage unit. Segment data may be read 
from the storage medium into any free buffer or into a buffer 
occupied by the least recently used segment. In this manner, 
data for a segment may be immediately available in a buffer 
if that segment is requested a second time. 

As an alternative, a client may use another method to 
select a storage unit from which data will be retrieved, as 
discussed below. After sending the request, the client may 
receive an acknowledgment from the storage unit indicating 
that the request is in the disk queue at the storage unit. 
Instead of receiving an estimate of time at which the data 
will be available in a buffer at the storage unit, the client may 
wait until a ready signal is received indicating that the 
storage unit has read the requested data into a specified 
buffer memory at the storage unit. During this waiting 
period, the client may be performing other tasks, such as 
issuing requests for other data segments, displaying data or 
processing data. One problem with this alternative is that the 
client accepts an unsolicited message, i.e., the ready signal 
from the storage unit, in response to which the client 
changes context and processes the message. The client could 
be busy performing other operations. Although this process 
does provide a more accurate estimate of the time at which 
data is available in a buffer at the storage unit, the ability to 
change contexts and to process incoming messages quickly 
involves more complexity at the client. 

There are several other ways a storage unit may be 
selected from the segment table for a file when the segment 
table tracks copies of each segment. For example, when a 
client is making a file read request, the client may pick 
randomly from either the "A" list or "B" list for the file in 
question. Alternatively, the client may review all of its 
currently outstanding requests, i.e., requests sent but not yet 
fulfilled, and pick which storage unit out of the storage units 
on the A and B lists for the segment currently has the fewest 
outstanding requests. This selection method may reduce the 
chance of a client competing with its own outstanding 
requests, and tends to spread requests more evenly over all 
the storage units. Alternatively, rather than examining out- 
standing requests, a client may examine a history of its 
recent requests, e.g., the last "n" requests, and for the next 
request pick whichever storage unit from the A list and B list 
for the segment has been used less historically. This selec- 
tion method tends to spread requests more evenly over all 
the storage units, and tends to avoid a concentration of 
requests at a particular storage unit. The client also may 
request from each storage unit a measure of the length of its 
disk queue. The client may issue the request to the storage 
unit with the shortest disk queue. As another possibility, the 
client may send requests to two storage units and ultimately 
receive the data from only one. Using this method on a local 
area network, the client may cancel the unused request. On 
a wide area network, the storage unit that is ultimately 
selected may cancel the unused request at the other storage 
unit. 

A storage unit will likely receive multiple requests from 
multiple applications. In order to manage the requests from 
multiple applications to ensure that the most critical requests 
are handled first, a queue 48 (FIG. 1) is maintained for each 
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storage unit. The queue may be maintained in several parts, 
depending on the complexity of the system. In particular, the 
storage unit may maintain different queues for disk access 
and for network transfers. The queue may segregate requests 
from time-sensitive applications using data having specific 5 
due times, e.g., for playback to broadcast, from requests 
from other applications, such as capture systems, authoring 
tools or service and maintenance applications. Storage 
requests may be separated further from requests from 
authoring tools and requests from service and maintenance 10 
programs. Requests from authoring tools may be separated 
further from service and maintenance requests. 

FIG. 19 illustrates one embodiment of queue 48, utilizing 
a disk queue 300 and a network queue 320. The disk queue 
has four subqueues 302, 304, 306 and 308, one for each of 15 
the playback, capture, authoring and service and mainte- 
nance client programs, respectively. Similarly, the network 
queue 320 has four subqueues 322, 324, 326 and 328. Each 
queue includes one or more entries 310, each of which 
comprises a request field 312 indicating the client making 20 
the request and the requested operation, a priority field 314 
indicating the priority of the request, and a buffer field 316 
indicating the buffer associated with the request. The indi- 
cation of the priority of the request may be a deadline, a time 
stamp, an indication of an amount of memory available at 25 
the client, or an indication of an amount of data currently 
available at the client. A priority scheduling mechanism at 
the storage unit would dictate the kind of priority stamp to 
be used. 

The priority value may be generated in many ways. The 30 
priority value for an authoring or playback system is gen- 
erally a measure of time by which the application must 
receive the requested data. For example, for a read 
operation, the application may report how much data (in 
milliseconds or frames or bytes) it has available to play 35 
before it runs out of data. The priority indication for a 
capture system is generally a measure of time by which the 
client must transfer the data out of its buffers to the storage 
unit. For example, for a write operation, the application may 
report how much empty buffer space (in milliseconds, 40 
frames or bytes) it has available to fill before the buffer 
overflows. Using milliseconds as a unit of measure, the 
system may have an absolute time clock that could be used 
as the basis for ordering requests in the queue 48, and all 
applications and storage units may be synchronized to the 45 
absolute time clock. If such synchronization is not practical, 
the application may use a time that is relative to the 
application that indicates how much time from the time the 
request is made that may pass until the requested data should 
be received by the client. Assuming low communication so 
latency, the storage unit may convert this relative time to an 
absolute time that is consistent with the storage unit. 

The storage unit processes the requests in its disk queues 
302-308 in their priority order, i.e, operating on the requests 
in the highest priority queue first, in order by their priority 55 
value, then the requests in successively lower priority 
queues. For each request, the storage unit transfers data 
between the disk and the buffer indicated by the request. For 
a read request, after the request is processed, the request is 
transferred from the disk queue to the network queue. For a 60 
write request, the request is removed from the disk queue 
after the write operation completes successfully. 

In one embodiment to be described in more detail below, 
the storage unit uses the network queue to prioritize network 
transfers in the process of scheduling those transfers. In this 65 
embodiment, clients request transfer of data over the net- 
work. If a storage unit receives two such requests at about 
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the same time, the storage unit processes the request that has 
a higher priority in its network queue. For a read request, 
after the request is processed, the request is removed from 
the network queue. For a write request, the request is 
transferred from the network queue to the disk queue, with 
a priority depending on the availability of free buffers, after 
the transfer completes successfully. If the time has passed 
for a request in the network queue to be processed, the 
request may be dropped indicating that the client is no longer 
operating or did not request the network transfer in time. 

Data transfers between the storage units and clients over 
the computer network may be scheduled to improve effi- 
ciency. In particular, scheduling data transfers improves 
bandwidth utilization of the computer network. Such sched- 
uling of the network usage should be performed particularly 
if the bandwidth of the link between a client and a switch is 
on the same order of magnitude as the bandwidth of the link 
between the storage unit and the switch. In particular, if the 
storage unit sends data and the client receives data at the link 
speed of their respective network connections, data is not 
likely to accumulate at a network switch or to experience 
other significant delays. 

In order to enforce such utilization of the network, a 
mechanism may be provided that forces each client to 
receive data from only one storage unit, and that forces each 
storage unit to send data to only one client, at any given time. 
For example, each client may have only one token. The 
client sends this token to only one storage unit to request 
transfer of the data for a selected segment. The token may 
indicate the deadline by which the data must be received by 
the client, i.e., the priority measure, and the specified seg- 
ment. Each storage unit sends data to only one client at a 
time, from which it has received a token. The storage unit 
only accepts one token at a time. After the data is transferred, 
the storage unit also returns the token. 

Another network scheduling process will now be 
described in connection with FIGS. 20 and 21. This process 
provides a similar result but does not use a token. Rather a 
client requests a communication channel with a storage unit, 
specifying a segment and an amount of time E3 that the 
client is willing to wait for the transfer to occur. The client 
also may specify a new due time for the segment by which 
the client must receive the data. 

Referring now to FIG. 20, the client process for transfer- 
ring data over the network will now be described. At any 
point in time during the playback of a composition, each 
buffer has a segment of data associated with it and a time by 
which the data must be available in the buffer for continuous 
playback. As is known in the art, the application associates 
each of the buffers with a segment during the playback 
process. As shown above in connection with FIGS. 17 and 
18, each segment that a client has preread has an associated 
estimated time by which the data will be available at the 
storage unit. Accordingly, the client may order the buffers by 
their due time and whether the requested data is expected to 
be available in a buffer at the storage unit. This ordering may 
be used by the client to select a next buffer for which data 
will be transferred in step 500. The client requests a com- 
munication channel with the storage unit in step 502, speci- 
fying a waiting time E3. This value E3 may be short, e.g., 
100 milliseconds, if the client does not need the data, 
urgently and if the client may perform other operations more 
efficiently. This value E3 may be longer if the client needs 
the data urgently, for example, so that it does not run out of 
data for one of its buffers. In step 504, the client receives a 
reply from the storage unit. If the storage unit indicates that 
the request is rejected, as determined in step 506, a revised 
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estimated time is received with the message in step 508. This 
revised estimated time may be used to update the buffer list 
in step 510 from which buffers are selected. Processing 
returns to step 500 to select another buffer. A buffer for 
which the segment is on the same storage unit as the s 
previously selected segment probably should not be 
selected. If the storage unit otherwise accepts the request, 
the data ultimately is received in step 518. 

The process from the point of view of the storage unit will 
now be described in connection with FIG. 21. The storage 10 
unit receives a request from a client in step 520 indicating 
waiting time E3. If the data is not yet available in the buffers 
at that storage unit, as determined in step 522, the storage 
unit rejects the request in step 524 and computes a revised 
estimated time which is sent to the client. If the data is 15 
otherwise available and the network connection of the 
storage unit is not busy, as determined in step 526, then the 
client becomes an "active client" and the communication 
channel is granted by the storage unit in step 528, allowing 
data to be transferred. If the network connection of the 20 
storage unit is busy transferring data to another client, the 
storage unit maintains a request from a "waiting client," to 
which data is transferred after the data transfer for the 
"active client" is completed. In order to determine whether 
the current client should be the "waiting client," the storage 25 
unit estimates a time by which the transfer could occur, in 
step 530, based on the number of requests with earlier 
deadlines in the network queue multiplied by the network 
transmission time for each request. If the computed esti- 
mated time of availability is greater than the waiting time 30 
E3, indicating the client is not willing to wait that long, as 
determined in step 532, the request is rejected in step 524. 
Also, if the specified priority of this request is lower than the 
priority for any current waiting client, as determined in step 
534, the request is rejected in step 524. Otherwise, the 35 
request from any current waiting client is rejected in step 
536 and this new client is designated as the current waiting 
client. When a transfer to the active client is completed, the 
waiting client becomes the active client and the data is 
transferred. 40 

In order to transfer data from a client to a storage unit, a 
similar process may be used for scheduling the network 
transfer and for transferring the data from a buffer in the 
storage unit to nonvolatile storage. From the point of view 
of the client, this process will now be described in connec- 45 
tion with FIG. 22. This process may be used to implement 
step 124 and 126 in FIG. 3. 

Unlike the process of reading in which the client may 
place data into an arbitrary point within its set of buffers, the 
data to be transferred to a storage unit typically comes from 50 
a read pointer from a set of buffers used by the capture 
system. 

The capture system typically produces one or more 
streams of video information as well as one or more streams 
of audio information. Accordingly, the capture system may 55 
select one of the data streams according to the amount of 
free buffer space in the stream to receive captured data. This 
buffer at the current read pointer of the selected stream is 
selected in step 600. A write request is then sent to the 
storage unit in step 602. The request includes an identifier 60 
for the segment, a due time or other priority value, and a 
threshold E4 indicating an amount of time the client is 
willing to wait. The due time is used by the storage unit to 
prioritize network transfer requests. The threshold E4 is used 
by the client, similar to threshold E3 discussed above, to 65 
permit the client to efficiently schedule its own operations. 
The client, after sending the request to the storage unit, 



eventually receives a reply in step 604. If the reply indicates 
that the write request was rejected, as determined in step 
606, the reply includes an estimated time by which the 
storage unit will be available to receive the data. This 
estimated time may be used by the client to schedule other 
operations. If the storage unit accepts the request to write the 
data, the client then sends, in step 608, a portion of the 
segment of the data to the storage unit. A reply may be 
received in step 610 indicating whether or not the write 
request was successful, as analyzed in step 612. A failure 
may involve recovery processes in step 614. Otherwise the 
process is complete as indicated in step 616. 

From the point of view of the storage unit, the storage unit 
receives the write request from the client in step 620. The 
request indicates a due time or other priority stamp which is 
used to place the request within the network queue. The 
storage unit then determines in step 622 if a buffer is 
available for receiving the data. The storage unit may make 
such a buffer available. In the unlikely event that no buffers 
are available, the request may be rejected in step 624. 
Otherwise, a request is put in the network queue in step 626 
indicating the buffer allocated to receive the data, its priority 
stamp, and other information about the transfer. Next, the 
storage unit determines if the network connection is busy in 
step 628. If the network connection is not busy, the storage 
unit accepts the request in step 630 and sends a message to 
this effect to the client. The client then transfers the data 
which is received by the storage unit in step 632 and placed 
in the designated buffer. If the designated buffer is now full, 
as determined in step 634, the buffer is placed in the disk 
queue with an appropriate priority stamp in step 636. The 
storage unit's processing of its disk queue will eventually 
cause the data to be transferred from the buffer to permanent 
storage. Otherwise, the storage unit waits until the client 
sends enough data to fill the buffer as indicated in step 638. 

If the network connection of the storage unit is busy, as 
determined in step 628, the storage unit computes, in step 
640, an estimated time by which the network connection of 
the storage unit should be available. If this computed time is 
greater than the indicated waiting time E4, as determined in 
step 642, the request is rejected in step 624 with an estimate 
of the time of availability of the storage unit. If the storage 
unit expects to be able to transfer the data within the waiting 
time E4 indicated by a client, the storage unit compares the 
priority of the request with the priority of a request for any 
currently waiting client, in step 644. If this request is of a 
lower priority than the request of the currently waiting 
client, the request is rejected. Otherwise, the request from 
the currently waiting client is rejected, and this new request 
is made the next request to be processed in step 646. 

Additional embodiments for use when the redundancy 
information is created from two or more segments will now 
be described in connection with FIGS. 24 and 25. 

Referring now to FIG. 24, an example process for storing 
segments of data with redundancy information in a ran- 
domly distributed manner over several storage units will 
now be described in more detail. This process is generally 
similar to the process described above in connection with 
FIG. 3. First, in step 700, the capturing system creates a 
segment table 90B (FIG. 2B). An image index that maps 
each image to an offset in the stream of data to be captured, 
also typically is created. The indexed images may corre- 
spond to, for example, fields or frames of the video. The 
index may refer to other sample boundaries, such as a period 
of time, for other kinds of data, such as audio. The capturing 
system also obtains a list of available storage units, as 
described above. The capturing system also receives an 
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indication of a redundancy set size, either automatically selected in step 758, similar to step 204 in FIG. 5, The 

based on the list of available storage units or from a user. In reconstructed lost segments are stored in the selected storage 

general, the redundancy set size should be less than the units. The segment table is updated upon successful comple- 

number of available storage units, and may be a significantly lion of the storage operations. The updated segment table is 

smaller subset. A counter is also used to keep track of which 5 tnen s^ 1 t0 tne catalog manager in step 762. 

segments are in a given redundancy set. This counter is reset II * also possible to convert a file having one kind of 

to zero in step 700. An exclusive-or memory is also used, redundancy information, e.g., a copy of the segment, to 

which is reset to all binary unasserted values, e.g., "0." anol ? er ^ ad of redundancy information, e.g. an exchisive- 

Asegment of data is then created by the capturing system or f ° f ^ or m u ore segments. For example, an addition^ copy 

in step 720. An appropriate size for this segment was 10 f ft dal * may be m ^ T? * P' oce f * h ° WD » nG : 6 ' 

,. . . . t . *i_ j • r After this process is completed, the other form of redun- 

discussed above in connection with the description of FIG. d info ^ ation (the exclusive-or results of segments) 

3. The counter is also incremented in step 720 ^ deleted Sirxi [ larl lhe s shown in RG * 24 ' 

The current segment is stored locally as an exclusive-or of be used with stored data t0 create exc l us ive-or redundancy 

any segment already stored in the exclusive-or memory, in information. After creation of such information, any extra 

step 722. A storage unit is selected for the segment in step 35 copy 0 f data may be deleted using the process shown in FIG. 

724. Selection of the storage unit for a segment is random or 7 The form in which a file has redundancy information may 

pseudorandom. This selection may be independent of the vary from file to file and may be based on, for example, a 

selection made for any previous redundancy set. However, priority associated with the file and an indication of the form 

the selection should ensure that each segment in a redun- of the redundancy information may be stored in the catalog 

dancy set is stored on a different storage unit. Each file may 20 manager. 

use only a subset of the available storage units as discussed By scheduling data transfers over the network and by 

above in connection with the description of FIG. 3, distributing the load on the storage units with selected access 

After a storage unit is selected for the segment, the to randomly distributed segments of data with redundancy 

segment is sent to the storage unit in step 726 for storage. information, this system is capable of efficiently transferring 

The capture system then may wait for the storage unit to 25 multiple streams of data in both directions between multiple 

acknowledge completion of storage of the segment in step applications and multiple storage units in a highly scalable 

728. When data must be stored in real-time while being and reliable manner, which is particularly beneficial for 

captured, the data transfer in step 726 may occur in two distributed multimedia production. 

steps, similar to read operations, as discussed above. After One application that may be implemented using such a 

the data is successfully stored on the storage units, the 30 computer network is the capability to send and return 

segment table 90B is updated by the capturing system in step multiple streams to other external digital effects systems that 

730. are commonly used in live production. These systems may 

If the counter is currently equal to the redundancy set size, be complex and costly. Most disk-based nonlinear video 

as determined in step 732, the contents of the local editing systems have disk subsystems and bus architectures 

exclusive-or memory is the redundancy information. This 35 which cannot sustain multiple playback streams while 

redundancy information is then stored on the storage units. simultaneously recording an effects return stream, which 

In particular, the counter is reset in step 734. A storage unit limits their abilities to be used in an online environment, 

is selected for the redundancy information in step 736. The Using this system, several streams may be sent to an effects 

redundancy information is sent to the selected storage unit in system, which outputs an effects data stream to be stored on 

step 738. The capturing system then waits for acknowledg- 40 the multiple storage units. The several streams could be 

ment of successful storage in step 740. The segment table multiple camera sources or layers for dual digital video 

may then be updated in step 742. effects. 

If capture is complete, as determined in step 128, then the It is also possible to have multiple storage units providing 

process terminates; at this time any redundancy information data to one client to satisfy a client's need for a high 

stored in the exclusive-or memory should be stored in a 45 bandwidth stream of data that has a higher bandwidth than 

storage unit in step 745, using a procedure similar to step any one storage unit. For example, ifeach of twenty storage 

734 through 742. The updated segment table is then sent to units had a 10 MB/s link to a switch and a client had a 200 

the catalog manager in step 746. If the counter is not equal MB/s link to the switch, the client could read 200 MB/s from 

to the redundancy set size in step 732, and if capturing is not twenty storage units simultaneously, permitting transfer of a 

complete as determined in step 744, process continues by 50 data stream for high definition television (HDTV), for 

creating the next segment of data and incrementing the example. 

counter in step 720. Using the procedures outlined above, storage units and 

As discussed above in connection with FIG. 5, the redun- clients operate using local information and without central 

dancy information allows data to be recovered if one of the configuration management or control. A storage unit may be 

storage units has failed. FIG. 25 illustrates a process for 55 added to the system during operation without requiring the 

performing such failure recovery when the redundancy system to be shut down. The storage unit simply starts 

information is based on a redundancy set containing two or operation, informs clients of its availability, and then estab- 

more segments. As in FIG. 5, a file to be recovered is lishes processes to respond to access requests. This expand- 

selected in step 750. Any lost segments of that file are ability complements the capability and reliability of the 

identified in step 752. The redundancy set containing a lost 60 system. 

segment is then read in step 754. This step involves reading Having now described a few embodiments, it should be 

the redundancy information for the set created by exclusive- apparent to those skilled in the art that the foregoing is 

or of the segments in the set, and reading the remaining merely illustrative and not limiting, having been presented 

segments of the redundancy set. An exclusive-or of the by way of example only. Numerous modifications and other 

remaining segments and the redundancy information is then 65 embodiments are within the scope of one of ordinary skill in 

computed in step 756 to reconstruct the lost segment. A the art and are contemplated as falling within the scope of 

storage unit for each reconstructed lost segment is then the appended claims and equivalents thereto. 
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What is claimed is: 

1. A file system for a computer, enabling the computer to 
access remote independent storage units over a computer 
network in response to a request, from an application 
executed on the computer, to read data stored in a file on the 
storage units, wherein a file includes segments of the data 
and corresponding redundancy information for each 
segment, and wherein, for each file, each segment of the data 
is stored on a randomly or pseudorandomly selected one of 
the storage units, and wherein, for each segment of the data, 
the corresponding redundancy information is stored on a 
randomly or pseudorandomly selected one of the storage 
units, the file system comprising: 

means, responsive to the request to read data from a file, 
for selecting, for each segment of the requested data, 
one of the storage units on which data representing the 
segment is stored; 
means for reading each segment of the requested data 

from the selected storage unit for the segment; 
means for serializing the segments read from the selected 

storage units; and 
means for providing the serialized data to the application; 
wherein the means for selecting includes: 
in the file system: 

means for requesting data from one of the storage 

units, indicating an estimated time; 
means for requesting data from another of the stor- 
age units, indicating an estimated time, if the first 
storage unit rejects the request; and 
means for requesting the data from the first storage 
unit if the second storage unit rejects the request; 
and 

in each storage unit: 

means for rejecting a request for data if the request 

cannot be serviced by the storage unit within the 

estimated time; and 
means for accepting a request for data if the request 

can be serviced by the storage unit within the 

estimated time. 

2. A file system for a computer, enabling the computer to 
access remote independent storage units over a computer 
network in response to a request, from an application 
executed on the computer, to read data stored in a file on the 
storage units, wherein a file includes segments of the data 
and corresponding redundancy information for each 
segment, and wherein, for each file, each segment of the data 
is stored on a randomly or pseudorandomly selected one of 
the storage units, and wherein, for each segment of the data, 
the corresponding redundancy information is stored on a 
randomly or pseudorandomly selected one of the storage 
units, the file system comprising: 

means, responsive to the request to read data from a file, 
for selecting, for each segment of the requested data, 
one of the storage units on which data representing the 
segment is stored; 

means for reading each segment of the requested data 
from the selected storage unit for the segment; 

means for serializing the segments read from the selected 
storage units; and means for providing the serialized 
data to the application; 

wherein the means for reading each segment comprises 
means for scheduling the transfer of the data from the 
selected storage unit such that the storage unit effi- 
ciently transfers data, and includes: 
in the file system: 

means for requesting transfer of the data from the 
selected storage unit, indicating a waiting time; 
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means for requesting the data from another storage 
unit if the selected storage unit rejects the request 
to transfer the data; and 
in the storage unit: 

means for rejecting a request to transfer data if the 
data is not available to be transferred from the 
storage unit by the indicated waiting time; and 

means for transferring the data if the selected storage 
unit is able to transfer the data within the waiting 
time. 

3. A file system for a computer, enabling the computer to 
access remote independent storage units over a computer 
network in response to a request, from an application 
executed on the computer, to read data stored on the storage 
units, wherein segments of the data and corresponding 
redundancy information are randomly or pseudorandomly 
distributed among the plurality of storage units, the file 
system comprising: 

means, responsive to the request to read data, for 
selecting, for each segment of the requested data, one 
of the storage units on which data representing the 
segment is stored; 

means for transferring the data from the selected storage 
unit if the selected storage unit has the data available 
within an indicated waiting time; 

means for transferring the data from another storage unit 
if the selected storage unit does not have the data 
available within the indicated waiting time; and 

means for providing the transferred data to the applica- 
tion; 

wherein the means for selecting comprises: 

means for selecting a first one of the storage units if the 
first storage unit can transfer the segment within a 
first estimated time, and for selecting a second one of 
the storage units if the second one of the storage units 
can transfer the segment within a second estimated 
time and if the first storage unit cannot transfer the 
segment within the first estimated time, and for 
selecting the first storage unit if the second storage 
unit cannot transfer the segment within the second 
estimated time. 

4. The file system of claim 3, wherein the redundancy 
information corresponding to a segment is a copy of the 
segment. 

5. A file system for a computer, enabling the computer to 
access remote independent storage units over a computer 
network in response to a request, from an application 
executed on the computer, to read data stored in a file on the 
storage units, wherein a file includes segments of the data 
and corresponding redundancy information for each 
segment, and wherein, for each file, each segment of the data 
is stored on a randomly or pseudorandomly selected one of 
the storage units, and wherein, for each segment of the data, 
the corresponding redundancy information is stored on a 
randomly or pseudorandomly selected one of the storage 
units, the file system comprising: 

means, responsive to the request to read data from a file, 
for selecting, for each segment of the requested data, 
one of the storage units on which data representing the 
segment is stored; 

means for reading each segment of the requested data 
from the selected storage unit for the segment; 

means for serializing the segments read from the selected 
storage units; and means for providing the serialized 
data to the application; 
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wherein the means for selecting comprises: 

means for selecting a first one of the storage units if the 
first storage unit can transfer the segment within a 
first estimated time, and for selecting a second one of 
the storage units if the second one of the storage units 5 
can transfer the segment within a second estimated 
time and if the first storage unit cannot transfer the 
segment within the first estimated time, and for 
selecting the first storage unit if the second storage 
unit cannot transfer the segment within the second 10 
estimated time. 
6. A file system for a computer, enabling the computer to 
access remote independent storage units over a computer 
network in response to a request, from an application 
executed on the computer, to read data stored on the storage 15 
units, wherein segments of the data and corresponding 
redundancy information are randomly or pseudorandomly 
distributed among the plurality of storage units, wherein the 
redundancy information corresponding to a segment is a 
copy of the segment, the file system comprising: 
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means for selecting for each segment of the requested data 
one of the storage units on which the segment is stored, 
by selecting a first one of the storage units if the first 
storage unit can transfer the segment within a first 
estimated time, and by selecting a second one of the 
storage units if the second one of the storage units can 
transfer the segment within a second estimated time 
and if the first storage unit cannot transfer the segment 
within the first estimated time, and selecting the first 
storage unit if the second storage unit cannot transfer 
the segment within the second estimated time; 

means for reading each segment of the requested data 
from the selected storage unit for the segment; and 

means for providing the data to the application when the 
data is received from the identified storage units. 

7. The file system of claim 6, wherein the redundancy 
information corresponding to a segment is a copy of the 
segment. 

* * * * * 
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